-
Notifications
You must be signed in to change notification settings - Fork 180
[WIP]Add Func: aclgraph_batch_size auto-adjust to different model #739
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
parallel_type_cnt = 0 | ||
dp_size = self.vllm_config.parallel_config.data_parallel_size | ||
tp_size = self.vllm_config.parallel_config.tensor_parallel_size | ||
if dp_size > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the bigger the parallel size, the smaller the graph step? Should be bigger right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The types of parallel strategies influence the length of the list. Therefore, the more types of parallel strategies there are, the shorter the list becomes. However, the maximum supported batch_size value in the list remains unchanged.
from torch_npu.op_plugin.atb._atb_ops import _register_atb_extensions | ||
from vllm import LLM, SamplingParams | ||
|
||
_register_atb_extensions() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does this do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
torch_npu needs to preload atb's .so before the dyanmo trace procedure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"_register_atb_extensions()" has been removed
"Qwen/Qwen2.5-0.5B-Instruct", | ||
] | ||
|
||
TENSOR_PARALLELS = [2] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a multicard ut, let's move this to path tests/multicard
to make sure it is tested as expected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
has been moved to multicard
Please don't merge this PR, we may still need to discuss it with torch_npu and CANN team on this. This solution is neither follow the cuda behavior nor good for performance. |
Please replace all the npugraph to aclgraph. |
For now, seems we don't have much choice on this, for the large model with lots of layers and comm group, we may only have small number of aclgraph cached in memory. Which means enormous padding may happened in many scenario and thus cause the performance regression. cc @wangxiyuan @Yikun |
What this PR does / why we need it?
This PR add new function of : aclgraph_batch_size can dynamic adjust to different model; before this PR, the aclgraph_batch_sizes given from vllm to vllm-ascend always too large, and that may result in ERROR while running on different, with the information: "The resources are insufficient".
Now, with this PR, the code can dynamic adjust aclgraph_batch_sizes depend on the model hidden_layer_nums and parallel config, for example:
a. for Qwen2.5-7B, the aclgraph_batch_size length is 33 total;
b. for Qwen2.5-72B, the aclgraph_batch_size length is 11 total;